Store Sales - Time Series Forecasting

Author

Ben Elpidius, Danyili Hong, Honglei Yang, John Ju

Load Packages

Code
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer
import pandas as pd
import plotly.express as px
from sklearn.metrics import accuracy_score
import plotly.io as pio
from sklearn.ensemble import HistGradientBoostingClassifier
import plotly.graph_objs as go
import math

Overview

Theme

The theme of this kaggle competition is to forecast grocery sales in Ecuador. What makes it interesting is that it not only provides basic data and turnover data for local stores but also includes data on Ecuador’s oil prices, holidays, and events. These datasets are not directly related to grocery sales data but do impact grocery sales. This aspect makes the project both intriguing and challenging. In this project, we will predict how these external factors affect the transactions and sales of grocery stores.

Why

We decided to take on this kaggle because it will be helpful for business owners and store owners. It would be good to know when to open up a store and what to expect from it such as sales and transactions based on different external factors that will affect it. Additionally, store owners should also know sales and transactions forecast to understand when to restock or buy items, or when to close their store since nothing is selling during this time.

Question

How is the trend of transaction in Ecuador changing overtime, which is also affected by different external factors such as oil and holiday, impacting the total store sales. We aim to explore how these external factors have influenced local store transactions and attempt to predict their impact on transactions. If possible, we will compare the forecast results with and without these external factors to see the different and predict the sales after 2017 (2017-2018).

Dataset

The dataset consists of four parts. The stores.csv contains basic information about local stores, including store numbers, city and state locations, as well as the type and cluster of the store. The holidays_events.csv covers all holidays and events from 2012 to 2017, categorized as local, regional, and national. The oil.csv represents Ecuador’s daily oil prices, a significant factor affecting the local economy. Lastly, the transactions.csv details the daily transaction numbers for each store identified by its number. Additionally, the original dataset also consist of their own test and train dataset. We will use their train and test dataset to forecast previous sales, since sales is created through their own calculation, as well as making our own train and test dataset to predict future transaction.

Prior Work

Summary of Findings

EKREM BAYAR: https://www.kaggle.com/code/ekrembayar/store-sales-ts-forecasting-a-comprehensive-guide

Project includes the sales for each product family and store combinations, the transactions over year, interpolation for oil prices, data manipulation for holiday and events data, effects of outside variables such as earthquakes.

HOWOO JANG: https://www.kaggle.com/code/howoojang/first-kaggle-notebook-following-ts-tutorial

Project includes relationship between sales, transaction, and oil prices. Also includes effects of holidays and seasonality, and forecasting with machine learning.

Critique of prior work

Ekrem Bayar
  1. What did they do well that is inspiring? They explained every step they worked on with a reasonable texts and plot to show the evidence.

  2. What could they improve on or explore further? I think they did a perfect job on this project, and the only defect I would say it that the dataset is too big so whenever we try to interpret a plot, we can only tell the trend but details.

  3. Do you trust their results? Why or why not? Yes, I do. Like I mentioned before, they came up with reasonable explanations and plots to validate their results.

Howoo Jang
  1. What did they do well that is inspiring? They gave conclusions for each step which helps viewers understand.

  2. What could they improve on or explore further? They didn’t have any introduction about the project in the beginning and started with bunch of codes which makes the viewers confused.

  3. Do you trust their results? Why or why not? I do trust their results because they gave the reason for every chunk of codes and conclusions of the results.

Extension

In this project, we will be extending prior work by comparing the forecast and model with and without external factors and predicting the future transaction. This will include shifting the transaction value down by one day.

Approach

Problem Description

We will predict the transactions for the product at stores located in Ecuador after 2017. Through the variable transactions, oil price, and holiday, we will try to predict the future transactions. In order to do this, we will analyze the whole data to predict the future transaction. However, due to the extensive scale of the dataset, to ensure the feasibility of the analysis and the accuracy of the model predictions, we have narrowed down the data, selecting only the period from August 15, 2016, to August 15, 2017, for the first attempt. The second and third attempt uses the whole data, which made it more complicated and less accurate.

Provenance

Link: https://www.kaggle.com/competitions/store-sales-time-series-forecasting/data?select=train.csv The data set used in this project originates from the Kaggle website, specifically from a competition focused on forecasting sales in Ecuadorian stores. The data spans from 2012 to 2017, counting each day.

Data

Code
train = pd.read_csv("data/train.csv")
test = pd.read_csv("data/test.csv")
transactions = pd.read_csv("data/transactions.csv")
stores = pd.read_csv("data/stores.csv")
oil = pd.read_csv("data/oil.csv")
holidays_events = pd.read_csv("data/holidays_events.csv")

Features and Records

Code
transactions.info()
transactions.head(2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83488 entries, 0 to 83487
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   date          83488 non-null  object
 1   store_nbr     83488 non-null  int64 
 2   transactions  83488 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 1.9+ MB
date store_nbr transactions
0 2013-01-01 25 770
1 2013-01-02 1 2111

On the day, 2013-01-01, the store with store number 25 had total transaction of 770.

Code
stores.info()
stores.head(2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54 entries, 0 to 53
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   store_nbr  54 non-null     int64 
 1   city       54 non-null     object
 2   state      54 non-null     object
 3   type       54 non-null     object
 4   cluster    54 non-null     int64 
dtypes: int64(2), object(3)
memory usage: 2.2+ KB
store_nbr city state type cluster
0 1 Quito Pichincha D 13
1 2 Quito Pichincha D 13

Store number 1 is in the city of Quito, Pichincha State with store type D and cluster(cluster is a grouping of similar stores) 13.

Code
holidays_events.info()
holidays_events.head(2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   date         350 non-null    object
 1   type         350 non-null    object
 2   locale       350 non-null    object
 3   locale_name  350 non-null    object
 4   description  350 non-null    object
 5   transferred  350 non-null    bool  
dtypes: bool(1), object(5)
memory usage: 14.1+ KB
date type locale locale_name description transferred
0 2012-03-02 Holiday Local Manta Fundacion de Manta False
1 2012-04-01 Holiday Regional Cotopaxi Provincializacion de Cotopaxi False

It was a local holiday on 2012-03-02 called Fundacion de Manta which is celebrated in Manta.

Code
oil.info()
oil.head(2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1218 entries, 0 to 1217
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date        1218 non-null   object 
 1   dcoilwtico  1175 non-null   float64
dtypes: float64(1), object(1)
memory usage: 19.2+ KB
date dcoilwtico
0 2013-01-01 NaN
1 2013-01-02 93.14

The daily oil price wasn’t recorded on the day 2013-01-01.

Data Analysis

It is easy to merge the dataset together because the key value would be date and store number. These two key value are very useful and common. Additionally, the date records each day which makes it very detailed and useful for interpretation. However, because of that it is very complex. Some dates are missing or not recorded for some store numbers, which could be a result of no transactions or sales being made for that store.

Data Wrangling

Code
sales = pd.merge(transactions, stores, on='store_nbr', how='outer')
sales = pd.merge(sales, holidays_events, on="date", how="outer")
sales = pd.merge(sales, oil, on="date", how="outer")
sales.head(5)
date store_nbr transactions city state type_x cluster type_y locale locale_name description transferred dcoilwtico
0 2013-01-01 25.0 770.0 Salinas Santa Elena D 1.0 Holiday National Ecuador Primer dia del ano False NaN
1 2013-01-02 25.0 1038.0 Salinas Santa Elena D 1.0 NaN NaN NaN NaN NaN 93.14
2 2013-01-02 1.0 2111.0 Quito Pichincha D 13.0 NaN NaN NaN NaN NaN 93.14
3 2013-01-02 2.0 2358.0 Quito Pichincha D 13.0 NaN NaN NaN NaN NaN 93.14
4 2013-01-02 3.0 3487.0 Quito Pichincha D 8.0 NaN NaN NaN NaN NaN 93.14

Exploratory Data Analysis (EDA)

Code
show_plot(
  px.box(
    sales,
    x = "city",
    y = "transactions"
  )
)

From this chart, we can observe that the transaction count in Quito is significantly higher than in other cities. This could be because Quito is the capital, and its residents may have greater purchasing power.

Code
header_values = ['KPI', 'Value']
cell_values = [
    ['Number of Stores', 'Number of Different Products', 
     'Window Start Date', 'Window End Date',
     '#Rows in training set', '#Date Points'], 
    [train['store_nbr'].nunique(), train['family'].nunique(), 
     train['date'].min(), train['date'].max(),
     train.shape[0], train['date'].nunique()]
]

show_plot(go.Figure(data=[go.Table(header=dict(values=header_values),
                               cells=dict(values=cell_values))]))
Code
train_aux = train[['date', 'sales', 'onpromotion']].groupby('date').mean()
train_aux = train_aux.reset_index()

show_plot(go.Figure(data=go.Scatter(x=train_aux['date'], 
                                y=train_aux['sales'],
                                marker_color='red', text="sales")).update_layout({"title": 'Avg Sales by date for all stores and products',
                   "xaxis": {"title":"Date"},
                   "yaxis": {"title":"Avg Unit Sold"},
                   "showlegend": False}))

Increasing trend of the sales. In the last two years (since July 2015) the trend has been stable (almost stationary TS). By zooming in one can realize that there is a seasonality every 7 days, same pattern (during the weekends higher sales). The peak of the weekly sesoanlity is on Sundays/Saturdays. The 1st of Jan of every year the supermarkets are not open (sales=0).

Code
show_plot((px.scatter(train_aux[train_aux['onpromotion'] > 0], x="onpromotion", y="sales", color='sales', 
                 color_continuous_scale="earth",
                 size='sales', log_x=True, size_max=30).update_layout({"title": 'Correlation between OnPromotion and Sales (total avg sales and promotion each day)',
                   "xaxis": {"title":"On Promotion"},
                   "yaxis": {"title":"Sales"},
                   "showlegend": False})))

From this chart, we can see that promotional tactics have a significant impact on increasing sales volume.

Code
train['year'] = pd.to_datetime(train['date']).dt.year
train['month'] = pd.to_datetime(train['date']).dt.strftime("%B")
train['day_of_week'] = pd.to_datetime(train['date']).dt.day_name()
df_year = train.groupby('year').mean(numeric_only=True)[['sales']]
df_year = df_year.reset_index()

df_month = train.groupby('month').mean(numeric_only=True)[['sales']]
new_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
df_month = df_month.reindex(new_order, axis=0)
df_month = df_month.reset_index()

df_day_of_week = train.groupby('day_of_week').mean(numeric_only=True)[['sales']]
new_order_week = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
df_day_of_week = df_day_of_week.reindex(new_order_week, axis=0)
df_day_of_week = df_day_of_week.reset_index()

show_plot( px.bar(df_year, x='year', y='sales', color='sales', color_continuous_scale="darkmint")
                    .update_layout({"title": f'AVG SALES BY YEAR',
                   "xaxis": {"title":"YEAR"},
                   "yaxis": {"title":"Avg Unit Sold"},
                   "showlegend": False}))

From this chart, we can see that from 2013 to 2017, the overall sales volume trend has been steadily increasing.

Code
show_plot( px.bar(df_month, x='month', y='sales', color='sales', color_continuous_scale="puor").update_layout({"title": f'AVG SALES BY MONTH',
                   "xaxis": {"title":"MONTH"},
                   "yaxis": {"title":"Avg Unit Sold"},
                   "showlegend": False}))

From this chart, we can see that the sales volume in December is significantly higher than in other months. This could be due to major holidays contributing to increased purchases, such as Christmas.

Code
show_plot(px.bar(df_day_of_week, x='day_of_week', y='sales', color='sales', color_continuous_scale="speed")
          .update_layout({"title": 'AVG SALES BY DAY OF WEEK',
                          "xaxis": {"title":"DAY OF WEEK"},
                          "yaxis": {"title":"Avg Unit Sold"},
                          "showlegend": False}))

We can see that the sales volume on weekends is significantly higher than during the weekdays.

Code
previous_price = 93.14
for index, row in oil.iterrows():
    if math.isnan(row['dcoilwtico']):
        oil.loc[oil['date'] == row['date'], 'dcoilwtico'] = previous_price
    else:
        previous_price = row['dcoilwtico']
Code
train_aux = train[['date', 'sales']].groupby('date').mean()
train_aux = train_aux.reset_index()
show_plot(go.Figure(data=go.Scatter(x=oil['date'], 
                                y=oil['dcoilwtico'],
                                marker_color='blue', text="sales")).update_layout({"title": f'Oil Prices Chart',
                   "xaxis": {"title":"Date"},
                   "yaxis": {"title":"Oil Price"},
                   "showlegend": False}))

From the chart, we can observe that oil prices were relatively stable from 2013 to 2014. However, there was a significant drop in oil prices from 2014 to 2015. This downward trend continued into 2016. Starting from 2016, oil prices began to stabilize again, but they were still considerably lower compared to the levels in 2013.

Code
sales_oil = train.groupby('date').mean()['sales']
sales_oil = sales_oil.reset_index()
sales_oil = pd.merge(sales_oil, oil, on ='date', how='left')

# we don't have all the oil prices available, we impute them 
previous_price = 93.14
for index, row in sales_oil.iterrows():
    if math.isnan(row['dcoilwtico']):
        sales_oil.loc[sales_oil['date'] == row['date'], 'dcoilwtico'] = previous_price
    else: 
        previous_price = row['dcoilwtico']
/tmp/ipykernel_491/2593954.py:1: FutureWarning:

The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
Code
show_plot(px.scatter(sales_oil, x="dcoilwtico", y="sales", size='sales', color='sales',
                  color_continuous_scale="pinkyl").update_layout({"title": f'Correlation between Oil Prices and Total Average Sales',
                   "xaxis": {"title":"Oil Price"},
                   "yaxis": {"title":"Sales"},
                   "showlegend": False}))

We observed that higher oil prices correspond to a significant decrease in sales values.

Code
show_plot(
  px.line(
    sales,
    x = "date",
    y = "transactions",
   facet_col = "type_x",
   facet_col_wrap = 2
  )
)

Type a and type d seems to have higher transactions, but we are not sure what these type mean because it is not written in the description in kaggle.

Summary

Each EDA graph shows a clear trend, whether it is increasing or decreasing. Each graph shows that the sales and transactions are increasing every year or if a positive external factor is applied such as promotion, holidays, and season. Sales and transactions are decreasing when the oil price is increasing, which makes since oil prices reflects the price of products to be more expensive.

Modeling

Code
def evaluate(y_true, y_pred):
    return pd.Series({
        'MAE': mean_absolute_error(y_true, y_pred),
        'MAPE': mean_absolute_percentage_error(y_true, y_pred),
        # 'MSE': mean_squared_error(y_true, y_pred),
    })

The variable we are trying to predict is transaction. We will attempt training the model in 3 different ways.

First attempt

Code
transactions_query = sales.query("'2016-08-15' <= date <= '2017-08-15'")
transactions_query.head(5)
date store_nbr transactions city state type_x cluster type_y locale locale_name description transferred dcoilwtico
65412 2016-08-15 25.0 840.0 Salinas Santa Elena D 1.0 Holiday Local Riobamba Fundacion de Riobamba False 45.72
65413 2016-08-15 1.0 1776.0 Quito Pichincha D 13.0 Holiday Local Riobamba Fundacion de Riobamba False 45.72
65414 2016-08-15 2.0 1789.0 Quito Pichincha D 13.0 Holiday Local Riobamba Fundacion de Riobamba False 45.72
65415 2016-08-15 3.0 2938.0 Quito Pichincha D 8.0 Holiday Local Riobamba Fundacion de Riobamba False 45.72
65416 2016-08-15 4.0 1348.0 Quito Pichincha D 9.0 Holiday Local Riobamba Fundacion de Riobamba False 45.72

The first approach involved the direct use of data from August 15, 2016, to August 15, 2017. We merged the original data ourselves, including stores.csv, oil.csv, holidays_events.csv, and transaction.csv. The resulting dataframe was not based on the train and test data provided by this competition. However, this time, we did not attempt to incorporate time lags.

Setup
Code
#Selecting specific columns and removing missing values
new_transactions = transactions_query[['date', 'store_nbr', 'transactions']]
new_transactions = new_transactions.dropna()

#Define target variable and features
target = "transactions"
features = [i for i in new_transactions.columns if i != target]

train_original, test_original = train_test_split(new_transactions, test_size=0.2, random_state=42)
print(f"train_original: {len(train_original)} transactions, test_original: {len(test_original)} transactions")

train_original['date'] = pd.to_datetime(new_transactions['date'])
train_original['date'] = train_original['date'].apply(lambda x: x.timestamp())

test_original['date'] = pd.to_datetime(new_transactions['date'])
test_original['date'] = test_original['date'].apply(lambda x: x.timestamp())
train_original: 15639 transactions, test_original: 3910 transactions

We will use date and store_nbr as features because we want to know the total transaction of each store per dates. We will use MAE and MAPE to measure accuracy. We expect these three models to produce relatively normal MAE and MAPE values. Also, we will use linear regression, random forest regressor, and decision tree regressor to validate because our data is continuous.

Linear Regression
Code
lr = LinearRegression().fit(
    X=train_original[features],
    y=train_original[target]
)

train_original['lr'] = lr.predict(train_original[features])
evaluate(train_original[target], train_original['lr'])

test_original['lr'] = lr.predict(test_original[features])
evaluate(test_original[target], test_original['lr'])
MAE     726.517981
MAPE      0.534356
dtype: float64
Random Forest Regressor
Code
rf = RandomForestRegressor(max_depth=25).fit(
    X=train_original[features],
    y=train_original[target]
)
train_original['rf'] = rf.predict(train_original[features])
evaluate(train_original[target], train_original['rf'])

test_original['rf'] = rf.predict(test_original[features])
evaluate(test_original[target], test_original['rf'])
MAE     177.837771
MAPE      0.111316
dtype: float64
Decision Tree Regressor
Code
dr = DecisionTreeRegressor(max_depth=25).fit(
    X=train_original[features],
    y=train_original[target]
)
train_original['dr'] = dr.predict(train_original[features])
evaluate(train_original[target], train_original['dr'])

test_original['dr'] = dr.predict(test_original[features])
evaluate(test_original[target], test_original['dr'])
MAE     204.110751
MAPE      0.126474
dtype: float64

We merely adjusted the maximum depth of the Random Forest to achieve better MAE and MAPE values. Currently, Random Forest Regressor performs the best because the test result MAE is the lowest, 177.701769. However, with a training set MAPE of only 0.05 and a testing set MAPE of just 0.11, this indicates a potential risk of overfitting. Next, we will proceed with the cross-validation of these models.

Cross Validation
Code
# Cross-validation of Random Forest Model
# Reprocess the data to ensure that the date column is converted to a timestamp
new_transactions['date'] = pd.to_datetime(new_transactions['date'])
new_transactions['date'] = new_transactions['date'].apply(lambda x: x.timestamp())

# Update the feature list, ensuring that the target column is not included
features = [i for i in new_transactions.columns if i != target]

# perform cross-validation
rf_scores = cross_val_score(RandomForestRegressor(max_depth=25), X=new_transactions[features], y=new_transactions[target], cv=5, scoring='neg_mean_absolute_error')
rf_mae = -rf_scores.mean()
print(f"Random Forest MAE: {rf_mae}")
Random Forest MAE: 277.64885674827
Code
# Cross-validation of decision tree models
dr_scores = cross_val_score(DecisionTreeRegressor(max_depth=25), X=new_transactions[features], y=new_transactions[target], cv=5, scoring='neg_mean_absolute_error')
dr_mae = -dr_scores.mean()
print(f"Decision Tree MAE: {dr_mae}")
Decision Tree MAE: 296.957056256547
Visualization of Model
Code
# Compare the two model
test_model = test_original.melt(
    id_vars=['transactions'] + features,
    value_vars=['dr', 'rf'],
    var_name='model',
    value_name='prediction')

test_model['model'] = test_model['model'].replace({
    'dr': 'Decision Regressor',
    'rf': 'Random Regressor',
})

test_model['resid'] = test_model['transactions'] - test_model['prediction']
test_model.head()
transactions date store_nbr model prediction resid
0 3590.0 1.482797e+09 48.0 Decision Regressor 3250.0 340.0
1 1217.0 1.475798e+09 5.0 Decision Regressor 1184.0 33.0
2 1356.0 1.478995e+09 53.0 Decision Regressor 1274.0 82.0
3 809.0 1.492646e+09 13.0 Decision Regressor 819.0 -10.0
4 648.0 1.492214e+09 30.0 Decision Regressor 625.0 23.0
Code
show_plot(
  px.scatter(
    test_model,
    x="transactions",
    y="prediction",
    facet_col="model"
  )
  .update_traces(
      marker_size=3,
  )
)

The outcomes from these two predictive models were different, with the Random Forest Regression showing superior performance. MAE 67.768521 MAPE 0.047034 dtype: float64 MAE 177.656220 MAPE 0.111345 dtype: float64

After cross-validation, Random Forest MAE: 276.70372007575213

Second attempt

The second time, we started to attempt incorporating time lag into the original transaction data provided by the competition. We still chose three models: linear regression, decision tree, and random forest. The features consist of the newly added lagged data of transactions. The target variable remains the prediction of transactions.

Lagged Features
Code
# Lagging Feature first try
# Train the model with data lagged/shift by one day

# tidy data
transactions['date'] = pd.to_datetime(transactions['date'])
transactions.sort_values(by='date', inplace=True)

# Create a lagged feature of 1 day
transactions['transactions_lag1'] = transactions['transactions'].shift(1)

# Remove rows containing null values
transactions.dropna(inplace=True)

# Define Features and Target Variables
X_lag_transaction_day = transactions[['transactions_lag1']]  # feature variable
y_lag_transaction_day = transactions['transactions']         # target variable

# splitting the dataset
X_train_lag_transaction_day, X_test_lag_transaction_day, y_train_lag_transaction_day, y_test_lag_transaction_day = train_test_split(X_lag_transaction_day, y_lag_transaction_day, test_size=0.2, random_state=42)
Linear Regression Model
Code
# Create model
try_model = LinearRegression()

# Train model
try_model.fit(X_train_lag_transaction_day, y_train_lag_transaction_day)

# Conduct predictions
predictions = try_model.predict(X_test_lag_transaction_day)

# Evaluation of Linear Regression Model on the Test Set
try_lr_eval = evaluate(y_test_lag_transaction_day, predictions)
mse = mean_squared_error(y_test_lag_transaction_day, predictions)
print(f'Linear Regression Test Evaluation \n{try_lr_eval}\n')

# Evaluation of Linear Regression Model on the Training Set
train_predictions_lag = try_model.predict(X_train_lag_transaction_day)
train_lr_eval_lag = evaluate(y_train_lag_transaction_day, train_predictions_lag)
train_mse_lag = mean_squared_error(y_train_lag_transaction_day, train_predictions_lag)

print(f'Linear Regression Training Evaluation (Lagged Features):\n{train_lr_eval_lag}\n')
Linear Regression Test Evaluation 
MAE     607.888349
MAPE      0.458032
dtype: float64

Linear Regression Training Evaluation (Lagged Features):
MAE     617.495191
MAPE      0.456793
dtype: float64
Decision Tree Model
Code
# Create Decision Tree Model
dt_model_lag = DecisionTreeRegressor(max_depth=5, min_samples_leaf=3)

# Train model
dt_model_lag.fit(X_train_lag_transaction_day, y_train_lag_transaction_day)

# Conduct predictions
dt_predictions_test = dt_model_lag.predict(X_test_lag_transaction_day)
dt_predictions_train = dt_model_lag.predict(X_train_lag_transaction_day)

# Evaluation of Decision Tree Model on the Test Set
dt_eval_test = evaluate(y_test_lag_transaction_day, dt_predictions_test)
print(f'Decision Tree Test Evaluation \n{dt_eval_test}\n')

# Evaluation of Decision Tree Model on Training Set
dt_eval_train = evaluate(y_train_lag_transaction_day, dt_predictions_train)
print(f'Decision Tree Training Evaluation \n{dt_eval_train}\n')
Decision Tree Test Evaluation 
MAE     584.448378
MAPE      0.449514
dtype: float64

Decision Tree Training Evaluation 
MAE     592.12036
MAPE      0.44528
dtype: float64
Random Forest Model
Code
# Create a Random Forest Model
rf_model_lag = RandomForestRegressor(n_estimators=100, max_depth=30)

# Train model
rf_model_lag.fit(X_train_lag_transaction_day, y_train_lag_transaction_day)

# Conduct predictions
rf_predictions_test = rf_model_lag.predict(X_test_lag_transaction_day)
rf_predictions_train = rf_model_lag.predict(X_train_lag_transaction_day)

# Evaluation of Random Forest Model on the Test Set
rf_eval_test = evaluate(y_test_lag_transaction_day, rf_predictions_test)
print(f'Random Forest Test Evaluation \n{rf_eval_test}\n')

# Evaluation of Random Forest Model on the Training Set
rf_eval_train = evaluate(y_train_lag_transaction_day, rf_predictions_train)

print(f'Random Forest Training Evaluation \n{rf_eval_train}\n')
Random Forest Test Evaluation 
MAE     606.791550
MAPE      0.463317
dtype: float64

Random Forest Training Evaluation 
MAE     569.508599
MAPE      0.428687
dtype: float64

We can see that the Mean Absolute Errors (MAE) of the three models are all approximately around 400-600, while the Mean Absolute Percentage Errors (MAPE) are only between 0.4 and 0.6. The results are not ideal, so we attempt to incorporate more complex and diverse data for the third attempt

Visualization of Model
Code
# Merge Predicted Results
test_results = X_test_lag_transaction_day.copy()
test_results['Actual_Transactions'] = y_test_lag_transaction_day
test_results['LR_Predictions'] = rf_model_lag.predict(X_test_lag_transaction_day)
test_results['DT_Predictions'] = dt_model_lag.predict(X_test_lag_transaction_day)
test_results['RF_Predictions'] = rf_model_lag.predict(X_test_lag_transaction_day)
Code
show_plot(px.scatter(
    test_results,
    x='Actual_Transactions',
    y=['LR_Predictions', 'DT_Predictions', 'RF_Predictions'],
    labels={'value': 'Predicted Transactions'},
    title='Comparison of Actual vs. Predicted Transactions'
))
Code
# Merge the prediction results of each model into a DataFrame
test_model = test_results.melt(id_vars=['Actual_Transactions'], 
                               value_vars=['LR_Predictions', 'DT_Predictions', 'RF_Predictions'],
                               var_name='Model', 
                               value_name='Predictions')

# Map the model names to a more readable format
model_names = {
    'LR_Predictions': 'Linear Regression',
    'DT_Predictions': 'Decision Tree',
    'RF_Predictions': 'Random Forest'
}
test_model['Model'] = test_model['Model'].map(model_names)

# Create Scatter Plot
show_plot(px.scatter(
    test_model,
    x='Actual_Transactions',
    y='Predictions',
    facet_col='Model',
    color='Model',
    labels={'Predictions': 'Predicted Transactions'},
    title='Comparison of Actual vs. Predicted Transactions'
).update_traces(marker=dict(size=3))
)

Third attempt

We have incorporated data on oil prices, merging transactions from various stores into the total national transaction volume within a day. Lag features for each month of the year and each day have been added. We also cleaned up some rows with NaN values. In contrast to the first attempt, we utilized data from 2013 to 2017, anticipating that such data would provide better training material.

Prepare Data
Code
# Adding lag data
# Remove all categorical data from sales, only leaving quantitative data
quantitative_sales = sales.filter(['date', 'transactions', 'dcoilwtico'])

# Ensure the date column is in datetime format
quantitative_sales['date'] = pd.to_datetime(quantitative_sales['date'])

# Sum transactions and average dcoilwtico to get the total transactions for each day nationwide
quantitative_sales = quantitative_sales.groupby('date').agg({
    'transactions': 'sum',    # Sum the transactions column
    'dcoilwtico': 'mean'      # Average the dcoilwtico column
}).reset_index()

quantitative_sales = quantitative_sales[
    (quantitative_sales['date'] >= '2013-01-02') & 
    (quantitative_sales['date'] <= '2017-08-15')
]

# Use linear interpolation to fill in NaN values in the oil price (dcoilwtico)
quantitative_sales['dcoilwtico'] = quantitative_sales['dcoilwtico'].interpolate(method='linear')

# Create a year column
quantitative_sales['year'] = quantitative_sales['date'].dt.year

# Create a month column
quantitative_sales['month'] = quantitative_sales['date'].dt.month

# Create a column for the year and the day of the year
quantitative_sales['year'] = quantitative_sales['date'].dt.year
quantitative_sales['day_of_year'] = quantitative_sales['date'].dt.dayofyear

# Create a lag feature for the same month each year and use linear interpolation to handle NaN values
quantitative_sales['lag_same_month'] = quantitative_sales.groupby(['year', 'month'])['transactions'].shift(1)
quantitative_sales['lag_same_month'] = quantitative_sales['lag_same_month'].interpolate(method='linear')

# Apply lag directly to the entire time series and use forward fill to handle NaN values
quantitative_sales['lag_transactions'] = quantitative_sales['transactions'].shift(1)
quantitative_sales['lag_transactions'] = quantitative_sales['lag_transactions'].fillna(method='ffill')

# Remove the row of 2013-01-01 because it contains NaN
quantitative_sales = quantitative_sales.dropna()

# Remove the columns of year, month, and day_of_year to prevent them from impacting the model
quantitative_sales = quantitative_sales.drop(['year', 'month', 'day_of_year'], axis=1)

# Try removing the oil price (dcoilwtico) column to experiment with the model
#quantitative_sales = quantitative_sales.drop(['dcoilwtico'], axis=1)

Because during the process of troubleshooting, it was considered that oil prices might be considered noise for the model. Therefore, an attempt was made to remove the oil price data in the middle, but there was no progress, so it was commented out.

Modeling
Code
# Adding holiday data
#Create a new column and initialize it to False
holidays_events['is_holiday'] = False

# Update the 'is_holiday' column
for index, row in holidays_events.iterrows():
    if row['type'] == 'Holiday' and not row['transferred']:
        holidays_events.at[index, 'is_holiday'] = True
    elif row['type'] in ['Transfer', 'Additional']:
        holidays_events.at[index, 'is_holiday'] = True

# Keep the required columns
holidays = holidays_events[['date', 'is_holiday']]

# Merge the holiday data with quantitative_sales
quantitative_sales['date'] = pd.to_datetime(quantitative_sales['date'])
holidays['date'] = pd.to_datetime(holidays['date'])
quantitative_sales = pd.merge(quantitative_sales, holidays, on='date', how='left')

# Fill in the missing 'is_holiday' values as False after merging
quantitative_sales['is_holiday'] = quantitative_sales['is_holiday'].fillna(False)
/tmp/ipykernel_491/959183010.py:17: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

When training the model, not including the boolean values for holidays can make the model more inaccurate for certain dates with extreme trading volumes. Therefore, in this case, organizing holiday data and extracting it as a feature variable for training the model.

Code
#Try reducing the data to see if the model's performance will improve
# Select data between 2016-08-14 and 2017-08-14
# Attempt to narrow down the data to see if it improves model efficiency
# Select data between 2016-08-14 and 2017-08-14
# quantitative_sales = quantitative_sales[(quantitative_sales['date'] >= pd.Timestamp('2016-08-14')) &(quantitative_sales['date'] <= pd.Timestamp('2017-08-14'))]

When attempting modeling later on, we tried experimenting with a shorter time to see if it would improve the model’s performance. However, in reality, there wasn’t much improvement, so we commented it out.

Code
# Because the data for New Year's and Christmas is extremely limited, very small, so Boolean values are used to mark these days.
# Create a new column to mark Christmas and New Year.
# Marking New Year's Day and Christmas as extreme cases because their data is very small
# Create a new column to mark Christmas and New Year's Day
quantitative_sales['is_christmas_or_newyear'] = quantitative_sales['date'].apply(
    lambda x: x.month == 12 and x.day == 25 or x.month == 1 and x.day == 1
)

Two extreme data points were added for boolean value tagging. Since the trading volumes on December 24th and January 1st are both 0 every year, boolean values are used to mark these two extreme cases in an attempt to improve model performance.

Preparing data for the models
Code
# Preparing data
X = quantitative_sales[['lag_same_month', 'lag_transactions', 'dcoilwtico','is_holiday','is_christmas_or_newyear']]  # features
y = quantitative_sales['transactions']  # target

# Splitting data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False, random_state=42)
Linear Regression Model
Code
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
lr_predictions = lr_model.predict(X_test)
lr_eval = evaluate(y_test, lr_predictions)
print(f'Linear Regression Test Evaluation:\n{lr_eval}\n')

lr_train_predictions = lr_model.predict(X_train)
lr_train_eval = evaluate(y_train, lr_train_predictions)
print(f'Linear Regression Training Evaluation:\n{lr_train_eval}\n')
Linear Regression Test Evaluation:
MAE     1.250002e+04
MAPE    3.388676e+17
dtype: float64

Linear Regression Training Evaluation:
MAE     1.215187e+04
MAPE    3.011712e+17
dtype: float64
Decision tree Model
Code
dt_model = DecisionTreeRegressor(max_depth=5,min_samples_leaf=3)
dt_model.fit(X_train, y_train)
dt_predictions = dt_model.predict(X_test)
dt_eval = evaluate(y_test, dt_predictions)
print(f'Decision Tree Regressor Test Evaluation:\n{dt_eval}\n')

dt_train_predictions = dt_model.predict(X_train)
dt_train_eval = evaluate(y_train, dt_train_predictions)
print(f'Decision Tree Training Evaluation:\n{dt_train_eval}\n')
Decision Tree Regressor Test Evaluation:
MAE     11716.169815
MAPE        0.270153
dtype: float64

Decision Tree Training Evaluation:
MAE     9.717719e+03
MAPE    3.861322e+15
dtype: float64
Random Forest Model
Code
rf_model = RandomForestRegressor(n_estimators=1,max_depth=15)
rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)
rf_eval = evaluate(y_test, rf_predictions)
print(f'Random Forest Test Evaluation:\n{rf_eval}\n')

rf_train_predictions = rf_model.predict(X_train)
rf_train_eval = evaluate(y_train, rf_train_predictions)
print(f'Random Forest Training Evaluation:\n{rf_train_eval}\n')
Random Forest Test Evaluation:
MAE     14389.142851
MAPE        0.275480
dtype: float64

Random Forest Training Evaluation:
MAE     5311.977238
MAPE       0.064005
dtype: float64

It can be observed that under this massive and complex dataset, the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) of the linear regression model have reached significant values. While the decision tree model has a test set MAPE of only 0.27, the training set MAPE is an extremely large number, indicating insufficient performance of the model. On the other hand, the Random Forest model, although marginally usable, has a test set MAPE ranging from approximately 0.25 to 0.31, while the training set MAPE is between 0.05 and 0.06. This demonstrates a typical overfitting phenomenon. Unfortunately, despite adjusting the maximum depth, the changes in the numbers show no clear pattern, and a maximum depth of 15 appears to yield the best results at present.

Cross-validation
Code
# cross-validation of Linear Regression Models
# linear regression model's MAE
lr_mae_scores = cross_val_score(LinearRegression(), X, y, cv=5, scoring='neg_mean_absolute_error')
lr_mae_avg = -lr_mae_scores.mean()
print(f'Linear Regression - Avg MAE: {lr_mae_avg}')
Linear Regression - Avg MAE: 12493.208182913555
Code
# Cross-validation of Decision Tree Model
# decision tree model's MAE
dt_mae_scores = cross_val_score(DecisionTreeRegressor(max_depth=5, min_samples_leaf=3), X, y, cv=5, scoring='neg_mean_absolute_error')
dt_mae_avg = -dt_mae_scores.mean()
print(f'Decision Tree - Avg MAE: {dt_mae_avg}')
Decision Tree - Avg MAE: 11585.983434735343
Code
# Cross-validation of Random Forest model
# Random Forest model's MAE
rf_mae_scores = cross_val_score(RandomForestRegressor(n_estimators=1, max_depth=30), X, y, cv=5, scoring='neg_mean_absolute_error')
rf_mae_avg = -rf_mae_scores.mean()
print(f'Random Forest - Avg MAE: {rf_mae_avg}')
Random Forest - Avg MAE: 13844.625072886298

We attempted cross-validation but were unable to alter the model’s MAE score. So far, we have tried various adjustments, but none have fundamentally changed the performance of the random forest model we trained. We suspect that using too much data to train the model has increased its complexity, and this attempt should be considered among our limitations.

Findings

In this project, our goal was to use time series analysis to forecast sales trends and predict transation trends of stores in Ecuador. The dataset we used was sourced from Kaggle, particularly from a competition aimed at predicting sales in Ecuadorian stores, with data ranging from 2012 to 2017. In our first attempt, given the large scale of the dataset, we limited our data range to August 15, 2016, to August 15, 2017, to ensure the feasibility of our analysis and the accuracy of our model predictions. Sales data was only present in the original training dataset provided; since we chose not to use the original training data, we utilized transaction data to generate a training set for prediction. In terms of model building, we opted for random forest regression and decision tree regression as our primary predictive tools. The results of these two predictive models were similar, with random forest regression performing slightly better.

In the second attempt, we started experimenting with incorporating time lags into the original transaction data provided by the competition, which used the whole dataset from 2012 to 2017. We selected three models for our analysis: linear regression, decision trees, and random forests, with the newly added lagged transaction data as our feature. The objective remained the prediction of transactions. However, the results for all three models were not ideal, with the Mean Absolute Error (MAE) ranging around 400-600 and the Mean Absolute Percentage Error (MAPE) between 0.4 and 0.6. To improve our results, we decided to add more complex and diverse data in our third attempt.

In this third attempt, we included oil price data and combined transactions from various stores to represent the total national transactions per day. We added lag features for the same month each year and daily lags, while also removing rows with NaN values. Differing from our first approach, we utilized all available data from 2013 to 2017, hoping that this would provide better training material. During our model training, when trying to eliminate errors, we considered the possibility that oil prices might be acting as noise in our model. We attempted to remove the oil price data, but this didn’t lead to any significant progress, so we decided to comment it out. We also noted that excluding holiday booleans made the model less accurate due to extreme transaction volumes on specific dates. Therefore, we organized holiday data and extracted it as a feature for model training.

Subsequently, we experimented with using shorter time periods to see if it would enhance model performance, but there was no significant improvement, leading us to comment out this approach as well. We marked two extreme data points with boolean values, as transaction volumes are zero on December 24th and January 1st each year, trying to improve model performance. With these complex datasets, the linear regression model’s MAE and MAPE reached very high values. The decision tree had a test set MAPE of only 0.27, but an excessively large MAPE on the training set, indicating insufficient performance. The most usable model was the random forest, with a test set MAPE of about 0.25-0.31 and a training set MAPE of 0.05-0.06, a typical case of overfitting. Unfortunately, adjusting the maximum depth did not result in a regular pattern of improvement, with a depth of 15 providing the best results so far. Attempts at cross-validation also failed to change the MAE score of the model significantly.

Up to this point, we’ve tried various adjustments but have been unable to fundamentally change the performance of the model trained with the random forest. We surmise that using an excessive amount of data for training increased complexity, and this should be considered a limitation of our approach.

Limitations and Social / Ethical Considerations

In our project, we focused primarily on forecasting time series sales in the Ecuadorian retail market. Our analysis revealed several key limitations. Firstly, since our dataset was mainly focused on sales data from a specific region, this could have introduced a geographical bias, limiting the applicability of our analysis to other areas. Due to the complexity of the data, our method of imputing missing oil prices might not have fully accounted for the potential impact of oil price fluctuations on sales data, which could affect the accuracy of our predictive models. In this project, we primarily used linear regression, decision trees, and random forest models, but these traditional statistical models might not be sufficient to capture the full complexity of the data, especially when considering the relationships between sales and external factors like holidays or economic indicators.

Future Directions

During the course of the project, we identified several new research directions. For instance, we found that exploring deeper relationships between sales data and factors like weather, holidays, and even socio-economic variables could be very valuable. This would not only require the collection of a broader range of data, such as weather records and macroeconomic indicators, but might also necessitate the use of more advanced analysis techniques, like deep learning models, which can more effectively capture and predict complex time series data. Additionally, comparing sales patterns between different regions or types of stores is an intriguing direction, which would require us to integrate a more diverse set of datasets and possibly involve new data integration and cleaning techniques. Through these new research directions, we hope to gain a more comprehensive understanding of consumer behavior and improve the accuracy and reliability of sales forecasts.